Model Report for EPC:tabular

Generated on 30 Jun 2025, 17:38   ●   3,374 original samples, 3,374 synthetic samples

Accuracy
83.4%
(90.2%)
Univariate 94.8%
(97.4%)
Bivariate 85.8%
(92.2%)
Trivariate 69.7%
(81.0%)
Similarity
Cosine Similarity 0.94544
(0.99230)
Discriminator AUC 95.1%
(50.0%)
Distances
Identical Matches 0.0%
(0.0%)
Average Distances 0.254
(0.256)
DCR Share 53.4%
(50.0%)
NNDR Ratio 1.025
(1.000)

Correlations

Univariate Distributions

 

Bivariate Distributions

Accuracy

Column Univariate Bivariate Trivariate
EPt 98.1% 90.5% 77.2%
yie 97.9% 88.7% 70.7%
EPc 97.3% 90.2% 76.0%
total_opaque_surface 97.3% 82.9% 65.9%
latitude 97.2% 87.1% 69.0%
system_type 97.1% 90.3% 76.2%
average_opaque_surface_transmittance 97.0% 87.7% 69.8%
average_glazed_surface_transmittance 96.9% 87.9% 70.2%
total_glazed_surface 96.8% 84.1% 67.3%
floors 96.8% 90.8% 77.5%
EPw 96.5% 89.2% 71.5%
longitude 96.5% 86.5% 68.7%
Cm 96.5% 84.2% 67.6%
EPgl 95.9% 87.3% 69.0%
Asol 95.9% 85.1% 68.5%
nominal_power 95.7% 85.1% 68.7%
EPv 95.4% 90.1% 77.0%
QHimp 95.2% 84.1% 67.3%
degree_days 94.9% 86.6% 69.0%
QHnd 94.9% 83.8% 67.1%
air_changes 94.7% 87.8% 70.6%
heated_gross_volume 94.6% 80.9% 63.9%
heat_loss_surface 94.5% 79.0% 61.4%
EPh 94.4% 86.0% 68.2%
installation_year 94.3% 88.2% 71.0%
cooled_gross_volume 93.9% 85.2% 70.4%
surface_to_volume_ratio 93.8% 85.4% 68.9%
total_effective_ventilation_flow 93.7% 86.8% 70.7%
net_area 93.6% 78.7% 61.8%
EPl 93.5% 87.4% 70.4%
heated_usable_area 93.1% 79.0% 62.2%
ventilation_type 91.7% 89.1% 78.9%
construction_year 91.6% 86.0% 69.6%
DPR412_classification 90.8% 86.0% 71.7%
cooled_usable_area 80.5% 75.5% 64.1%
Total
 
94.8%
(97.4%)
85.8%
(92.2%)
69.7%
(81.0%)

Explainer
Accuracy of synthetic data is assessed by comparing the distributions of the synthetic (shown in green) and the original data (shown in gray). For each distribution plot we sum up the deviations across all categories, to get the so-called total variation distance (TVD). The reported accuracy is then simply reported as 100% - TVD. These accuracies are calculated for all univariate, bivariate and trivariate distributions. A final accuracy score is then calculated as the average across all of these.

Similarity


Explainer
These plots show the first 3 principal components of training samples, synthetic samples, and (if available) holdout samples within the embedding space. The black dots visualize the centroids of the respective samples. The similarity metric then measures the cosine similarity between these centroids. We expect the cosine similarity to be close to 1, indicating that the synthetic samples are as similar to the training samples as the holdout samples are.

Distances

Synthetic vs. Training Synthetic vs. Holdout Training vs. Holdout
Identical Matches 0.0% 0.0% 0.3%
DCR Average 0.254 0.256 0.152
NNDR Min10 0.538 0.524 0.311
DCR Share 53.4% of synthetic samples are closer to a training than to a holdout sample
NNDR Ratio 1.025 = (NNDR Min10 of Synthetic vs. Training) / (NNDR Min10 of Synthetic vs. Holdout)

Explainer
Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference. This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size. A green line that is significantly left of the dark gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data. A green line that overlays with the dark gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples. The DCR share indicates the proportion of synthetic samples that are closer to a training sample than to a holdout sample, and ideally, this value should not significantly exceed 50%, as a higher value could indicate overfitting. The NNDR ratio is the ratio of the 10-th smallest NNDR for synthetic vs. training, divided by 10-th smallest NNDR for synthetic vs. holdout. Ideally, this value should be close to 1, indicating that the synthetic samples are in sparse as well as in dense regions just as close to the training samples as to the holdout samples.